|
Distributed fault localization framework for high performance computing
GAO Jian, YU Kang, QING Peng, WEI Hongmei
Journal of Computer Applications
2018, 38 (1):
44-49.
DOI: 10.11772/j.issn.1001-9081.2017071948
To solve the problem of high difficulty and poor real-time in fault localization for high performance computing system, a Message-Passing based Fault Localization (MPFL) framework was proposed, which included Tree-based Fault Detection (TFD) and Tree-based Fault Analysis (TFA) algorithms. Firstly, when the parallel application was initialized, the Fault Localization Tree (FLT) was obtained by logically dividing all the nodes participating in the computing, and the fault localization tasks were distributed to different nodes. Secondly, if the abnormal state of a node was detected by system components such as message-passing library and operating system, the TFD algorithm was used to analyze the FLT structure, and the node responsible for receiving the abnormal state was selected according to factors such as load balancing and performance cost. Finally, the fault was derived from the received abnormal state, which was reasoned by the node that used TFA algorithm. The rule-based event correlation and the lightweight active probing based on message-passing were used in TFA algorithm, and the accuracy of fault analysis was improved by combining these two approaches. The experimental evaluation was performed on a typical cluster, which demonstrated the capability of MPFL by locating the shutdown simulation nodes. The experimental results on the NPB-FT and NPB-IS benchmarks show that the MPFL framework has excellent performance on fault localization capability and cost saving.
Reference |
Related Articles |
Metrics
|
|